This project explores the various clustering algorithms for segmenting information using various helpful packages in Python. Models applied in the analysis to cluster high dimensional data included the K-Means, Affinity Propagation, Mean Shift, Spectral Clustering and Agglomerative Clustering algorithms. The different clustering algorithms were evaluated using the silhouete coefficient which measures how well-separated the clusters are and how similar an object is to its own cluster (cohesion) compared to other clusters (separation). All results were consolidated in a Summary presented at the end of the document.
Cluster analysis is a form of unsupervised learning method aimed at identifying similar structural patterns in an unlabeled data set by segmenting the observations into clusters with shared characteristics as compared to those in other clusters. The algorithms applied in this study attempt to formulate partitioned segments from the data set through the hierarchical (either agglomeratively when smaller clusters are merged into the larger clusters or divisively when larger clusters are divided into smaller clusters) and non-hierarchical (when each observation is placed in exactly one of the mutually exclusive clusters) methods.
Datasets used for the analysis were separately gathered and consolidated from various sources including:
This study hypothesized that various death rates by major cancer types contain inherent patterns and structures within the data, enabling the grouping of similar countries and the differentiation of dissimilar ones.
Due to the unspervised learning nature of the analysis, there is no target variable defined for the study.
The clustering descriptor variables for the study are:
The target descriptor variables for the study are:
The metadata variables for the study are:
##################################
# Installing shap package
##################################
# !pip install geopandas
##################################
# Setting the Python Environment
##################################
import os
os.environ["OMP_NUM_THREADS"] = '1'
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
%matplotlib inline
from operator import add,mul,truediv
from sklearn.preprocessing import PowerTransformer, StandardScaler
from scipy import stats
from sklearn.cluster import KMeans, AffinityPropagation, MeanShift, SpectralClustering, AgglomerativeClustering, Birch, BisectingKMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import geopandas as gpd
##################################
# Setting Global Options
##################################
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
##################################
# Loading the dataset
##################################
cancer_death_rate = pd.read_csv('CancerDeathsByCountryCode.csv')
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate.shape)
Dataset Dimensions:
(208, 16)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_death_rate.dtypes)
Column Names and Data Types:
COUNTRY object CODE object PROCAN float64 BRECAN float64 CERCAN float64 STOCAN float64 ESOCAN float64 PANCAN float64 LUNCAN float64 COLCAN float64 LIVCAN float64 SMPREV float64 OWPREV float64 ACSHAR float64 GEOLAT float64 GEOLON float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_death_rate.head()
| COUNTRY | CODE | PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | GEOLAT | GEOLON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 6.3700 | 8.6700 | 3.9000 | 29.3000 | 6.9600 | 2.7200 | 12.5300 | 8.4300 | 10.2700 | 11.9000 | 23.0000 | 0.2100 | 33.9391 | 67.7100 |
| 1 | Albania | ALB | 8.8700 | 6.5000 | 1.6400 | 10.6800 | 1.4400 | 6.6800 | 26.6300 | 9.1500 | 6.8400 | 20.5000 | 57.7000 | 7.1700 | 41.1533 | 20.1683 |
| 2 | Algeria | DZA | 5.3300 | 7.5800 | 2.1800 | 5.1000 | 1.1500 | 4.2700 | 10.4600 | 8.0500 | 2.2000 | 11.2000 | 62.0000 | 0.9500 | 28.0339 | 1.6596 |
| 3 | American Samoa | ASM | 20.9400 | 16.8100 | 5.0200 | 15.7900 | 1.5200 | 5.1900 | 28.0100 | 16.5500 | 7.0200 | NaN | NaN | NaN | -14.2710 | -170.1322 |
| 4 | Andorra | AND | 9.6800 | 9.0200 | 2.0400 | 8.3000 | 3.5600 | 10.2600 | 34.1800 | 22.9700 | 9.4400 | 26.6000 | 63.7000 | 11.0200 | 42.5462 | 1.6016 |
##################################
# Performing a general exploration of the numeric variables
##################################
if (len(cancer_death_rate.select_dtypes(include='number').columns)==0):
print('No numeric columns identified from the data.')
else:
print('Numeric Variable Summary:')
display(cancer_death_rate.describe(include='number').transpose())
Numeric Variable Summary:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| PROCAN | 208.0000 | 11.7260 | 7.6965 | 2.8100 | 6.5875 | 10.0050 | 13.9900 | 54.1500 |
| BRECAN | 208.0000 | 11.3350 | 4.3649 | 4.6900 | 8.3975 | 10.5600 | 13.0950 | 37.1000 |
| CERCAN | 208.0000 | 6.0651 | 5.1204 | 0.7100 | 1.8575 | 4.4800 | 9.0575 | 39.9500 |
| STOCAN | 208.0000 | 10.5975 | 5.8993 | 3.4000 | 6.6350 | 9.1550 | 13.6725 | 46.0400 |
| ESOCAN | 208.0000 | 4.8946 | 4.1320 | 0.9600 | 2.3350 | 3.3100 | 5.4150 | 25.7600 |
| PANCAN | 208.0000 | 6.6004 | 3.0552 | 1.6000 | 4.2300 | 6.1150 | 8.7450 | 19.2900 |
| LUNCAN | 208.0000 | 21.0217 | 11.4489 | 5.9500 | 11.3800 | 20.0200 | 27.5125 | 78.2300 |
| COLCAN | 208.0000 | 13.6945 | 5.5475 | 4.9400 | 9.2775 | 12.7950 | 17.1325 | 31.3800 |
| LIVCAN | 208.0000 | 5.9826 | 9.0501 | 0.6500 | 2.8400 | 3.8950 | 6.0750 | 115.2300 |
| SMPREV | 186.0000 | 17.0140 | 8.0416 | 3.3000 | 10.4250 | 16.4000 | 22.8500 | 41.1000 |
| OWPREV | 191.0000 | 48.9963 | 17.0164 | 18.3000 | 31.2500 | 55.0000 | 60.9000 | 88.5000 |
| ACSHAR | 187.0000 | 6.0013 | 4.1502 | 0.0030 | 2.2750 | 5.7000 | 9.2500 | 20.5000 |
| GEOLAT | 208.0000 | 19.0381 | 24.3776 | -40.9006 | 4.1377 | 17.3443 | 40.0876 | 71.7069 |
| GEOLON | 208.0000 | 16.2690 | 71.9576 | -175.1982 | -11.1506 | 19.4388 | 47.8118 | 179.4144 |
##################################
# Performing a general exploration of the object variable
##################################
if (len(cancer_death_rate.select_dtypes(include='object').columns)==0):
print('No object columns identified from the data.')
else:
print('Object Variable Summary:')
display(cancer_death_rate.describe(include='object').transpose())
Object Variable Summary:
| count | unique | top | freq | |
|---|---|---|---|---|
| COUNTRY | 208 | 208 | Afghanistan | 1 |
| CODE | 203 | 203 | AFG | 1 |
##################################
# Performing a general exploration of the categorical variables
##################################
if (len(cancer_death_rate.select_dtypes(include='category').columns)==0):
print('No categorical columns identified from the data.')
else:
print('Categorical Variable Summary:')
display(cancer_rate.describe(include='category').transpose())
No categorical columns identified from the data.
Data quality findings based on assessment are as follows:
##################################
# Counting the number of duplicated rows
##################################
cancer_death_rate.duplicated().sum()
0
##################################
# Gathering the data types for each column
##################################
data_type_list = list(cancer_death_rate.dtypes)
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(cancer_death_rate.columns)
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(cancer_death_rate)] * len(cancer_death_rate.columns))
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(cancer_death_rate.isna().sum(axis=0))
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(cancer_death_rate.count())
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(all_column_quality_summary)
| Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
|---|---|---|---|---|---|---|
| 0 | COUNTRY | object | 208 | 208 | 0 | 1.0000 |
| 1 | CODE | object | 208 | 203 | 5 | 0.9760 |
| 2 | PROCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 3 | BRECAN | float64 | 208 | 208 | 0 | 1.0000 |
| 4 | CERCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 5 | STOCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 6 | ESOCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 7 | PANCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 8 | LUNCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 9 | COLCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 10 | LIVCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 11 | SMPREV | float64 | 208 | 186 | 22 | 0.8942 |
| 12 | OWPREV | float64 | 208 | 191 | 17 | 0.9183 |
| 13 | ACSHAR | float64 | 208 | 187 | 21 | 0.8990 |
| 14 | GEOLAT | float64 | 208 | 208 | 0 | 1.0000 |
| 15 | GEOLON | float64 | 208 | 208 | 0 | 1.0000 |
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
4
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
if (len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])==0):
print('No columns with Fill.Rate < 1.00.')
else:
display(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)].sort_values(by=['Fill.Rate'], ascending=True))
| Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
|---|---|---|---|---|---|---|
| 11 | SMPREV | float64 | 208 | 186 | 22 | 0.8942 |
| 13 | ACSHAR | float64 | 208 | 187 | 21 | 0.8990 |
| 12 | OWPREV | float64 | 208 | 191 | 17 | 0.9183 |
| 1 | CODE | object | 208 | 203 | 5 | 0.9760 |
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1.00)]
##################################
# Gathering the metadata labels for each observation
##################################
row_metadata_list = cancer_death_rate["COUNTRY"].values.tolist()
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(cancer_death_rate.columns)] * len(cancer_death_rate))
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(cancer_death_rate.isna().sum(axis=1))
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_metadata_list,
column_count_list,
null_row_list,
missing_rate_list),
columns=['Row.Name',
'Column.Count',
'Null.Count',
'Missing.Rate'])
display(all_row_quality_summary)
| Row.Name | Column.Count | Null.Count | Missing.Rate | |
|---|---|---|---|---|
| 0 | Afghanistan | 16 | 0 | 0.0000 |
| 1 | Albania | 16 | 0 | 0.0000 |
| 2 | Algeria | 16 | 0 | 0.0000 |
| 3 | American Samoa | 16 | 3 | 0.1875 |
| 4 | Andorra | 16 | 0 | 0.0000 |
| ... | ... | ... | ... | ... |
| 203 | Vietnam | 16 | 0 | 0.0000 |
| 204 | Wales | 16 | 4 | 0.2500 |
| 205 | Yemen | 16 | 0 | 0.0000 |
| 206 | Zambia | 16 | 0 | 0.0000 |
| 207 | Zimbabwe | 16 | 0 | 0.0000 |
208 rows × 4 columns
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
25
##################################
# Identifying the rows
# with Missing.Rate > 0.00
##################################
row_missing_rate = all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)]
##################################
# Identifying the rows
# with Missing.Rate > 0.00
##################################
if (len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])==0):
print('No rows with Missing.Rate > 0.00.')
else:
display(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)].sort_values(by=['Missing.Rate'], ascending=False))
| Row.Name | Column.Count | Null.Count | Missing.Rate | |
|---|---|---|---|---|
| 204 | Wales | 16 | 4 | 0.2500 |
| 135 | Northern Ireland | 16 | 4 | 0.2500 |
| 57 | England | 16 | 4 | 0.2500 |
| 186 | Tokelau | 16 | 4 | 0.2500 |
| 161 | Scotland | 16 | 4 | 0.2500 |
| 198 | United States Virgin Islands | 16 | 3 | 0.1875 |
| 173 | South Sudan | 16 | 3 | 0.1875 |
| 158 | San Marino | 16 | 3 | 0.1875 |
| 149 | Puerto Rico | 16 | 3 | 0.1875 |
| 20 | Bermuda | 16 | 3 | 0.1875 |
| 3 | American Samoa | 16 | 3 | 0.1875 |
| 118 | Monaco | 16 | 3 | 0.1875 |
| 74 | Guam | 16 | 3 | 0.1875 |
| 72 | Greenland | 16 | 3 | 0.1875 |
| 136 | Northern Mariana Islands | 16 | 3 | 0.1875 |
| 132 | Niue | 16 | 2 | 0.1250 |
| 140 | Palau | 16 | 2 | 0.1250 |
| 141 | Palestine | 16 | 2 | 0.1250 |
| 181 | Taiwan | 16 | 2 | 0.1250 |
| 41 | Cook Islands | 16 | 2 | 0.1250 |
| 125 | Nauru | 16 | 1 | 0.0625 |
| 154 | Saint Kitts and Nevis | 16 | 1 | 0.0625 |
| 116 | Micronesia | 16 | 1 | 0.0625 |
| 112 | Marshall Islands | 16 | 1 | 0.0625 |
| 192 | Tuvalu | 16 | 1 | 0.0625 |
##################################
# Formulating the dataset
# with numeric columns only
##################################
cancer_death_rate_numeric = cancer_death_rate.select_dtypes(include='number')
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = cancer_death_rate_numeric.columns
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = cancer_death_rate_numeric.min()
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = cancer_death_rate_numeric.mean()
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = cancer_death_rate_numeric.median()
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = cancer_death_rate_numeric.max()
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0] for x in cancer_death_rate_numeric]
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1] for x in cancer_death_rate_numeric]
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [cancer_death_rate_numeric[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_numeric]
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [cancer_death_rate_numeric[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_numeric]
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = cancer_death_rate_numeric.nunique(dropna=True)
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(cancer_death_rate_numeric)] * len(cancer_death_rate_numeric.columns))
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_death_rate_numeric.skew()
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = cancer_death_rate_numeric.kurtosis()
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_minimum_list,
numeric_mean_list,
numeric_median_list,
numeric_maximum_list,
numeric_first_mode_list,
numeric_second_mode_list,
numeric_first_mode_count_list,
numeric_second_mode_count_list,
numeric_first_second_mode_ratio_list,
numeric_unique_count_list,
numeric_row_count_list,
numeric_unique_count_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Minimum',
'Mean',
'Median',
'Maximum',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio',
'Skewness',
'Kurtosis'])
if (len(cancer_death_rate_numeric.columns)==0):
print('No numeric columns identified from the data.')
else:
display(numeric_column_quality_summary)
| Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PROCAN | 2.8100 | 11.7260 | 10.0050 | 54.1500 | 15.4100 | 9.2300 | 2 | 2 | 1.0000 | 198 | 208 | 0.9519 | 2.1250 | 6.1837 |
| 1 | BRECAN | 4.6900 | 11.3350 | 10.5600 | 37.1000 | 10.2900 | 8.9900 | 3 | 2 | 1.5000 | 190 | 208 | 0.9135 | 1.5844 | 5.4634 |
| 2 | CERCAN | 0.7100 | 6.0651 | 4.4800 | 39.9500 | 4.6200 | 1.5200 | 3 | 3 | 1.0000 | 189 | 208 | 0.9087 | 1.9715 | 8.3399 |
| 3 | STOCAN | 3.4000 | 10.5975 | 9.1550 | 46.0400 | 7.0200 | 6.5800 | 2 | 2 | 1.0000 | 196 | 208 | 0.9423 | 2.0526 | 7.3909 |
| 4 | ESOCAN | 0.9600 | 4.8946 | 3.3100 | 25.7600 | 2.5200 | 1.6800 | 3 | 3 | 1.0000 | 180 | 208 | 0.8654 | 2.0659 | 5.2990 |
| 5 | PANCAN | 1.6000 | 6.6004 | 6.1150 | 19.2900 | 3.1300 | 3.0700 | 3 | 2 | 1.5000 | 187 | 208 | 0.8990 | 0.9127 | 1.5264 |
| 6 | LUNCAN | 5.9500 | 21.0217 | 20.0200 | 78.2300 | 10.7500 | 11.6200 | 3 | 2 | 1.5000 | 200 | 208 | 0.9615 | 1.2646 | 2.8631 |
| 7 | COLCAN | 4.9400 | 13.6945 | 12.7950 | 31.3800 | 10.9000 | 12.2900 | 2 | 2 | 1.0000 | 199 | 208 | 0.9567 | 0.7739 | 0.1459 |
| 8 | LIVCAN | 0.6500 | 5.9826 | 3.8950 | 115.2300 | 2.7500 | 2.7400 | 6 | 4 | 1.5000 | 173 | 208 | 0.8317 | 9.1131 | 104.2327 |
| 9 | SMPREV | 3.3000 | 17.0140 | 16.4000 | 41.1000 | 22.4000 | 26.5000 | 4 | 4 | 1.0000 | 141 | 208 | 0.6779 | 0.4096 | -0.4815 |
| 10 | OWPREV | 18.3000 | 48.9963 | 55.0000 | 88.5000 | 61.6000 | 28.4000 | 5 | 3 | 1.6667 | 157 | 208 | 0.7548 | -0.1617 | -0.9762 |
| 11 | ACSHAR | 0.0030 | 6.0013 | 5.7000 | 20.5000 | 0.6900 | 12.0300 | 3 | 2 | 1.5000 | 177 | 208 | 0.8510 | 0.3532 | -0.5657 |
| 12 | GEOLAT | -40.9006 | 19.0381 | 17.3443 | 71.7069 | 55.3781 | 53.4129 | 2 | 2 | 1.0000 | 206 | 208 | 0.9904 | -0.1861 | -0.6520 |
| 13 | GEOLON | -175.1982 | 16.2690 | 19.4388 | 179.4144 | -3.4360 | -8.2439 | 2 | 2 | 1.0000 | 206 | 208 | 0.9904 | -0.2025 | 0.3981 |
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Identifying the numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
if (len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])==0):
print('No numeric columns with First.Second.Mode.Ratio > 5.00.')
else:
display(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
No numeric columns with First.Second.Mode.Ratio > 5.00.
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3)|(numeric_column_quality_summary['Skewness']<(-3))])
1
yy = numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))]
len(yy)
1
##################################
# Identifying the numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
if (len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])==0):
print('No numeric columns with Skewness > 3.00 or Skewness < -3.00.')
else:
display(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
| Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | LIVCAN | 0.6500 | 5.9826 | 3.8950 | 115.2300 | 2.7500 | 2.7400 | 6 | 4 | 1.5000 | 173 | 208 | 0.8317 | 9.1131 | 104.2327 |
##################################
# Formulating the dataset
# with object column only
##################################
cancer_death_rate_object = cancer_death_rate.select_dtypes(include='object')
##################################
# Gathering the variable names for the object column
##################################
object_variable_name_list = cancer_death_rate_object.columns
##################################
# Gathering the first mode values for the object column
##################################
object_first_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[0] for x in cancer_death_rate_object]
##################################
# Gathering the second mode values for each object column
##################################
object_second_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[1] for x in cancer_death_rate_object]
##################################
# Gathering the count of first mode values for each object column
##################################
object_first_mode_count_list = [cancer_death_rate_object[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_object]
##################################
# Gathering the count of second mode values for each object column
##################################
object_second_mode_count_list = [cancer_death_rate_object[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_object]
##################################
# Gathering the first mode to second mode ratio for each object column
##################################
object_first_second_mode_ratio_list = map(truediv, object_first_mode_count_list, object_second_mode_count_list)
##################################
# Gathering the count of unique values for each object column
##################################
object_unique_count_list = cancer_death_rate_object.nunique(dropna=True)
##################################
# Gathering the number of observations for each object column
##################################
object_row_count_list = list([len(cancer_death_rate_object)] * len(cancer_death_rate_object.columns))
##################################
# Gathering the unique to count ratio for each object column
##################################
object_unique_count_ratio_list = map(truediv, object_unique_count_list, object_row_count_list)
object_column_quality_summary = pd.DataFrame(zip(object_variable_name_list,
object_first_mode_list,
object_second_mode_list,
object_first_mode_count_list,
object_second_mode_count_list,
object_first_second_mode_ratio_list,
object_unique_count_list,
object_row_count_list,
object_unique_count_ratio_list),
columns=['Object.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
if (len(cancer_death_rate_object.columns)==0):
print('No object columns identified from the data.')
else:
display(object_column_quality_summary)
| Object.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | COUNTRY | Afghanistan | Albania | 1 | 1 | 1.0000 | 208 | 208 | 1.0000 |
| 1 | CODE | AFG | PSX | 1 | 1 | 1.0000 | 203 | 208 | 0.9760 |
##################################
# Counting the number of object columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of object columns
# with Unique.Count.Ratio > 10.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Formulating the dataset
# with categorical columns only
##################################
cancer_death_rate_categorical = cancer_death_rate.select_dtypes(include='category')
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = cancer_death_rate_categorical.columns
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[0] for x in cancer_death_rate_categorical]
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[1] for x in cancer_death_rate_categorical]
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [cancer_death_rate_categorical[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_categorical]
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [cancer_death_rate_categorical[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_categorical]
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = cancer_death_rate_categorical.nunique(dropna=True)
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(cancer_death_rate_categorical)] * len(cancer_death_rate_categorical.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
categorical_first_mode_list,
categorical_second_mode_list,
categorical_first_mode_count_list,
categorical_second_mode_count_list,
categorical_first_second_mode_ratio_list,
categorical_unique_count_list,
categorical_row_count_list,
categorical_unique_count_ratio_list),
columns=['Categorical.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
if (len(cancer_death_rate_categorical.columns)==0):
print('No categorical columns identified from the data.')
else:
display(categorical_column_quality_summary)
No categorical columns identified from the data.
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Performing a general exploration of the original dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate.shape)
Dataset Dimensions:
(208, 16)
##################################
# Filtering out the rows with
# with Missing.Rate > 0.00
##################################
cancer_death_rate_filtered_row = cancer_death_rate.drop(cancer_death_rate[cancer_death_rate.COUNTRY.isin(row_missing_rate['Row.Name'].values.tolist())].index)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_filtered_row.shape)
Dataset Dimensions:
(183, 16)
##################################
# Re-evaluating the missing data summary
# for the filtered data
##################################
variable_name_list = list(cancer_death_rate_filtered_row.columns)
null_count_list = list(cancer_death_rate_filtered_row.isna().sum(axis=0))
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
null_count_list),
columns=['Column.Name',
'Null.Count'])
display(all_column_quality_summary)
| Column.Name | Null.Count | |
|---|---|---|
| 0 | COUNTRY | 0 |
| 1 | CODE | 0 |
| 2 | PROCAN | 0 |
| 3 | BRECAN | 0 |
| 4 | CERCAN | 0 |
| 5 | STOCAN | 0 |
| 6 | ESOCAN | 0 |
| 7 | PANCAN | 0 |
| 8 | LUNCAN | 0 |
| 9 | COLCAN | 0 |
| 10 | LIVCAN | 0 |
| 11 | SMPREV | 0 |
| 12 | OWPREV | 0 |
| 13 | ACSHAR | 0 |
| 14 | GEOLAT | 0 |
| 15 | GEOLON | 0 |
##################################
# Identifying the columns
# with Null.Count > 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Null.Count']>1.00)])
0
##################################
# Formulating a new dataset object
# for the cleaned data
##################################
cancer_death_rate_cleaned = cancer_death_rate_filtered_row
cancer_death_rate_cleaned.reset_index(drop=True,inplace=True)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_cleaned.shape)
Dataset Dimensions:
(183, 16)
##################################
# Formulating the cleaned dataset
# with geolocation data
##################################
cancer_death_rate_cleaned_numeric = cancer_death_rate_cleaned.select_dtypes(include='number')
cancer_death_rate_cleaned_numeric_geolocation = cancer_death_rate_cleaned_numeric[['GEOLAT','GEOLON']]
##################################
# Formulating the cleaned dataset
# with numeric columns only
# without the geolocation data
##################################
cancer_death_rate_cleaned_numeric.drop(['GEOLAT','GEOLON'], inplace=True, axis=1)
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(cancer_death_rate_cleaned_numeric.columns)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_death_rate_cleaned_numeric.skew()
##################################
# Computing the interquartile range
# for all columns
##################################
cancer_death_rate_cleaned_numeric_q1 = cancer_death_rate_cleaned_numeric.quantile(0.25)
cancer_death_rate_cleaned_numeric_q3 = cancer_death_rate_cleaned_numeric.quantile(0.75)
cancer_death_rate_cleaned_numeric_iqr = cancer_death_rate_cleaned_numeric_q3 - cancer_death_rate_cleaned_numeric_q1
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((cancer_death_rate_cleaned_numeric < (cancer_death_rate_cleaned_numeric_q1 - 1.5 * cancer_death_rate_cleaned_numeric_iqr)) | (cancer_death_rate_cleaned_numeric > (cancer_death_rate_cleaned_numeric_q3 + 1.5 * cancer_death_rate_cleaned_numeric_iqr))).sum()
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(cancer_death_rate_cleaned_numeric)] * len(cancer_death_rate_cleaned_numeric.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_skewness_list,
numeric_outlier_count_list,
numeric_row_count_list,
numeric_outlier_ratio_list),
columns=['Numeric.Column.Name',
'Skewness',
'Outlier.Count',
'Row.Count',
'Outlier.Ratio'])
display(numeric_column_outlier_summary)
| Numeric.Column.Name | Skewness | Outlier.Count | Row.Count | Outlier.Ratio | |
|---|---|---|---|---|---|
| 0 | PROCAN | 2.2461 | 11 | 183 | 0.0601 |
| 1 | BRECAN | 1.9575 | 8 | 183 | 0.0437 |
| 2 | CERCAN | 1.9896 | 2 | 183 | 0.0109 |
| 3 | STOCAN | 2.0858 | 6 | 183 | 0.0328 |
| 4 | ESOCAN | 2.0918 | 24 | 183 | 0.1311 |
| 5 | PANCAN | 0.5992 | 1 | 183 | 0.0055 |
| 6 | LUNCAN | 0.8574 | 2 | 183 | 0.0109 |
| 7 | COLCAN | 0.8201 | 2 | 183 | 0.0109 |
| 8 | LIVCAN | 8.7158 | 19 | 183 | 0.1038 |
| 9 | SMPREV | 0.4165 | 0 | 183 | 0.0000 |
| 10 | OWPREV | -0.3341 | 0 | 183 | 0.0000 |
| 11 | ACSHAR | 0.3372 | 1 | 183 | 0.0055 |
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in cancer_death_rate_cleaned_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_cleaned_numeric, x=column)
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
def plot_correlation_matrix(corr, mask=None):
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr,
ax=ax,
mask=mask,
annot=True,
vmin=-1,
vmax=1,
center=0,
cmap='coolwarm',
linewidths=1,
linecolor='gray',
cbar_kws={'orientation': 'horizontal'})
##################################
# Computing the correlation coefficients
# and correlation p-values
# among pairs of numeric columns
##################################
cancer_death_rate_cleaned_numeric_correlation_pairs = {}
cancer_death_rate_cleaned_numeric_columns = cancer_death_rate_cleaned_numeric.columns.tolist()
for numeric_column_a, numeric_column_b in itertools.combinations(cancer_death_rate_cleaned_numeric_columns, 2):
cancer_death_rate_cleaned_numeric_correlation_pairs[numeric_column_a + '_' + numeric_column_b] = stats.pearsonr(
cancer_death_rate_cleaned_numeric.loc[:, numeric_column_a],
cancer_death_rate_cleaned_numeric.loc[:, numeric_column_b])
##################################
# Formulating the pairwise correlation summary
# for all numeric columns
##################################
cancer_death_rate_cleaned_numeric_summary = cancer_death_rate_cleaned_numeric.from_dict(cancer_death_rate_cleaned_numeric_correlation_pairs, orient='index')
cancer_death_rate_cleaned_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_cleaned_numeric_summary.sort_values(by=['Pearson.Correlation.Coefficient'], ascending=False).head(20))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| PANCAN_COLCAN | 0.7537 | 0.0000 |
| LUNCAN_COLCAN | 0.7010 | 0.0000 |
| LUNCAN_SMPREV | 0.6415 | 0.0000 |
| PANCAN_LUNCAN | 0.6367 | 0.0000 |
| COLCAN_ACSHAR | 0.5819 | 0.0000 |
| PANCAN_ACSHAR | 0.5750 | 0.0000 |
| PANCAN_OWPREV | 0.5212 | 0.0000 |
| CERCAN_ESOCAN | 0.4803 | 0.0000 |
| LUNCAN_ACSHAR | 0.4330 | 0.0000 |
| STOCAN_LIVCAN | 0.4291 | 0.0000 |
| SMPREV_OWPREV | 0.4164 | 0.0000 |
| COLCAN_SMPREV | 0.4126 | 0.0000 |
| COLCAN_OWPREV | 0.4102 | 0.0000 |
| LUNCAN_OWPREV | 0.4087 | 0.0000 |
| PROCAN_BRECAN | 0.4081 | 0.0000 |
| PROCAN_CERCAN | 0.3650 | 0.0000 |
| PANCAN_SMPREV | 0.3603 | 0.0000 |
| BRECAN_CERCAN | 0.3589 | 0.0000 |
| ESOCAN_LIVCAN | 0.3009 | 0.0000 |
| CERCAN_STOCAN | 0.2790 | 0.0001 |
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
cancer_death_rate_cleaned_numeric_correlation = cancer_death_rate_cleaned_numeric.corr()
mask = np.triu(cancer_death_rate_cleaned_numeric_correlation)
plot_correlation_matrix(cancer_death_rate_cleaned_numeric_correlation,mask)
plt.show()
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
def correlation_significance(df=None):
p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
for col in df.columns:
for col2 in df.drop(col,axis=1).columns:
_ , p = stats.pearsonr(df[col],df[col2])
p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
return p_matrix
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
cancer_death_rate_cleaned_numeric_correlation_p_values = correlation_significance(cancer_death_rate_cleaned_numeric)
mask = np.invert(np.tril(cancer_death_rate_cleaned_numeric_correlation_p_values<0.05))
plot_correlation_matrix(cancer_death_rate_cleaned_numeric_correlation,mask)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_cleaned_numeric.shape)
Dataset Dimensions:
(183, 12)
##################################
# Conducting a Yeo-Johnson Transformation
# to address the distributional
# shape of the variables
##################################
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson',
standardize=False)
cancer_death_rate_cleaned_numeric_array = yeo_johnson_transformer.fit_transform(cancer_death_rate_cleaned_numeric)
##################################
# Formulating a new dataset object
# for the transformed data
##################################
cancer_death_rate_transformed_numeric = pd.DataFrame(cancer_death_rate_cleaned_numeric_array,
columns=cancer_death_rate_cleaned_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_death_rate_transformed_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_transformed_numeric, x=column)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_transformed_numeric.shape)
Dataset Dimensions:
(183, 12)
cancer_death_rate_transformed_numeric
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.5595 | 1.5203 | 1.4836 | 2.1487 | 1.1939 | 1.5456 | 2.4417 | 1.9846 | 1.1035 | 4.7108 | 46.3969 | 0.2004 |
| 1 | 1.7272 | 1.4076 | 0.9307 | 1.7470 | 0.6928 | 2.6330 | 3.0570 | 2.0417 | 1.0378 | 6.4628 | 149.2448 | 3.8224 |
| 2 | 1.4670 | 1.4686 | 1.1002 | 1.4007 | 0.6154 | 2.0444 | 2.2954 | 1.9526 | 0.7708 | 4.5421 | 163.6112 | 0.7992 |
| 3 | 1.7704 | 1.5352 | 1.0595 | 1.6330 | 1.0009 | 3.2883 | 3.2603 | 2.6740 | 1.0911 | 7.4729 | 169.3720 | 5.1050 |
| 4 | 1.9120 | 1.6229 | 2.2285 | 1.6682 | 1.2514 | 1.7808 | 2.5816 | 2.0787 | 0.8161 | 3.8904 | 58.1402 | 3.7374 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178 | 1.9905 | 1.5533 | 1.9988 | 1.8183 | 0.8160 | 2.4509 | 2.8083 | 2.1890 | 0.7997 | 5.7294 | 168.3522 | 2.5885 |
| 179 | 1.2905 | 1.6500 | 1.6597 | 1.7169 | 0.9307 | 2.1167 | 3.0677 | 2.4898 | 0.8322 | 6.4983 | 34.8080 | 4.3473 |
| 180 | 1.5002 | 1.4362 | 1.0565 | 2.0113 | 1.0697 | 1.3550 | 2.3069 | 1.8210 | 0.8885 | 5.2531 | 120.5007 | 0.0504 |
| 181 | 1.9556 | 1.6259 | 2.3874 | 1.6536 | 1.3591 | 2.2641 | 2.3685 | 2.2649 | 0.8543 | 4.5909 | 58.9437 | 3.5868 |
| 182 | 2.1649 | 1.7314 | 2.5992 | 1.8555 | 1.3729 | 2.9086 | 2.5914 | 2.2817 | 1.1442 | 4.5666 | 88.2092 | 2.8255 |
183 rows × 12 columns
##################################
# Conducting standardization
# to transform the values of the
# variables into comparable scale
##################################
standardization_scaler = StandardScaler()
cancer_death_rate_transformed_numeric_array = standardization_scaler.fit_transform(cancer_death_rate_transformed_numeric)
##################################
# Formulating a new dataset object
# for the scaled data
##################################
cancer_death_rate_scaled_numeric = pd.DataFrame(cancer_death_rate_transformed_numeric_array,
columns=cancer_death_rate_transformed_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_death_rate_scaled_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_scaled_numeric, x=column)
##################################
# Consolidating both numeric columns
# and geolocation data
##################################
cancer_death_rate_preprocessed = pd.concat([cancer_death_rate_scaled_numeric,cancer_death_rate_cleaned_numeric_geolocation], axis=1, join='inner')
##################################
# Performing a general exploration of the consolidated dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_preprocessed.shape)
Dataset Dimensions:
(183, 14)
##################################
# Segregating the target
# and descriptor variable lists
##################################
cancer_death_rate_preprocessed_target_SMPREV = ['SMPREV']
cancer_death_rate_preprocessed_target_OWPREV = ['OWPREV']
cancer_death_rate_preprocessed_target_ACSHAR = ['ACSHAR']
cancer_death_rate_preprocessed_descriptors = cancer_death_rate_preprocessed.drop(['SMPREV','OWPREV','ACSHAR','GEOLAT','GEOLON'], axis=1).columns
##################################
# Segregating the target using SMPREV
# and descriptor variable names
##################################
y_variable = 'SMPREV'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.scatter(cancer_death_rate_preprocessed[x_variable],cancer_death_rate_preprocessed[y_variable])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel(y_variable)
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Segregating the target using OWPREV
# and descriptor variable names
##################################
y_variable = 'OWPREV'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.scatter(cancer_death_rate_preprocessed[x_variable],cancer_death_rate_preprocessed[y_variable])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel(y_variable)
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Segregating the target using ACSHAR
# and descriptor variable names
##################################
y_variable = 'ACSHAR'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.scatter(cancer_death_rate_preprocessed[x_variable],cancer_death_rate_preprocessed[y_variable])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel(y_variable)
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using SMPREV
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['OWPREV','ACSHAR','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['SMPREV_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'SMPREV'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| SMPREV_SMPREV | 1.0000 | 0.0000 |
| SMPREV_LUNCAN | 0.6538 | 0.0000 |
| SMPREV_CERCAN | -0.4866 | 0.0000 |
| SMPREV_PROCAN | -0.4232 | 0.0000 |
| SMPREV_COLCAN | 0.4198 | 0.0000 |
| SMPREV_PANCAN | 0.3604 | 0.0000 |
| SMPREV_ESOCAN | -0.2655 | 0.0003 |
| SMPREV_STOCAN | -0.1196 | 0.1070 |
| SMPREV_LIVCAN | 0.1163 | 0.1171 |
| SMPREV_BRECAN | 0.0566 | 0.4465 |
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using OWPREV
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['SMPREV','ACSHAR','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['OWPREV_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'OWPREV'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| OWPREV_OWPREV | 1.0000 | 0.0000 |
| OWPREV_PANCAN | 0.5360 | 0.0000 |
| OWPREV_CERCAN | -0.4677 | 0.0000 |
| OWPREV_ESOCAN | -0.4489 | 0.0000 |
| OWPREV_LUNCAN | 0.4445 | 0.0000 |
| OWPREV_COLCAN | 0.4442 | 0.0000 |
| OWPREV_STOCAN | -0.1189 | 0.1088 |
| OWPREV_BRECAN | 0.0490 | 0.5105 |
| OWPREV_PROCAN | 0.0280 | 0.7072 |
| OWPREV_LIVCAN | -0.0214 | 0.7737 |
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using ACSHAR
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['SMPREV','OWPREV','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['ACSHAR_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'ACSHAR'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| ACSHAR_ACSHAR | 1.0000 | 0.0000 |
| ACSHAR_COLCAN | 0.6039 | 0.0000 |
| ACSHAR_PANCAN | 0.5929 | 0.0000 |
| ACSHAR_LUNCAN | 0.4403 | 0.0000 |
| ACSHAR_PROCAN | 0.2083 | 0.0047 |
| ACSHAR_BRECAN | 0.1759 | 0.0172 |
| ACSHAR_CERCAN | -0.1347 | 0.0690 |
| ACSHAR_STOCAN | -0.1249 | 0.0921 |
| ACSHAR_ESOCAN | 0.0732 | 0.3248 |
| ACSHAR_LIVCAN | -0.0709 | 0.3401 |
##################################
# Consolidating relevant numeric columns
# after hypothesis testing
##################################
cancer_death_rate_premodelling = cancer_death_rate_preprocessed.drop(['GEOLAT','GEOLON'], axis=1)
##################################
# Performing a general exploration of the premodelling dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_premodelling.shape)
Dataset Dimensions:
(183, 12)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_death_rate_premodelling.dtypes)
Column Names and Data Types:
PROCAN float64 BRECAN float64 CERCAN float64 STOCAN float64 ESOCAN float64 PANCAN float64 LUNCAN float64 COLCAN float64 LIVCAN float64 SMPREV float64 OWPREV float64 ACSHAR float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_death_rate_premodelling.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | -0.5405 | -1.4979 | -1.6782 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0.5329 | 0.6090 | 0.4008 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | -0.6438 | 0.9033 | -1.3345 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1.1517 | 1.0213 | 1.1371 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | -1.0431 | -1.2574 | 0.3520 |
##################################
# Gathering the pairplot for all variables
##################################
sns.pairplot(cancer_death_rate_premodelling,
kind='reg',
plot_kws={'scatter_kws': {'alpha': 0.3}},)
plt.show()
##################################
# Preparing the clustering dataset
##################################
cancer_death_rate_premodelling_clustering = cancer_death_rate_premodelling.drop(['SMPREV','OWPREV','ACSHAR'], axis=1)
cancer_death_rate_premodelling_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 |
K-Means Clustering groups similar data points together into clusters by minimizing the mean distance between geometric points. The algorithm iteratively partitions data sets into a fixed number of non-overlapping k subgroups or clusters wherein each data point belongs to the cluster with the nearest mean cluster center. The process begins by initializing all the coordinates into a pre-defined k number of cluster centers. With every pass of the algorithm, each point is assigned to its nearest cluster center. The cluster centers are then updated to be the centers of all the points assigned to it in that pass. This is performed by re-calculating the cluster centers as the average of the points in each respective cluster. The algorithm repeats until there’s a minimum change of the cluster centers from the last iteration.
Silhouette Score assesses the quality of clusters created by a clustering algorithm. It measures how well-separated the clusters are and how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. The silhouette score ranges from -1 to 1, where a higher value indicates better-defined clusters. The silhouette method requires the computation of the silhouette scores for each data point which is the average dissimilarity of the data point with all other data points in the next-nearest cluster minus the average dissimilarity of the data point to points in the same cluster and divided by the larger of the two numbers. The overall silhouette score for the clustering is the average of the silhouette scores for all data points.
##################################
# Fitting the K-Means Clustering algorithm
# using a range of K values
##################################
kmeans_cluster_list = list()
kmeans_cluster_inertia = list()
kmeans_cluster_silhouette_score = list()
for cluster_count in range(2,10):
km = KMeans(n_clusters=cluster_count,
random_state=88888888,
n_init='auto',
init='k-means++')
km = km.fit(cancer_death_rate_premodelling_clustering)
kmeans_cluster_list.append(cluster_count)
kmeans_cluster_inertia.append(km.inertia_)
kmeans_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
km.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the K-Means Clustering algorithm
# using a range of K values
##################################
kmeans_clustering_evaluation_summary = pd.DataFrame(zip(kmeans_cluster_list,
kmeans_cluster_inertia,
kmeans_cluster_silhouette_score),
columns=['KMeans.Cluster.Count',
'KMeans.Cluster.Inertia',
'KMeans.Cluster.Silhouette.Score'])
kmeans_clustering_evaluation_summary
| KMeans.Cluster.Count | KMeans.Cluster.Inertia | KMeans.Cluster.Silhouette.Score | |
|---|---|---|---|
| 0 | 2 | 1238.4894 | 0.2355 |
| 1 | 3 | 1027.3347 | 0.2330 |
| 2 | 4 | 948.1192 | 0.2323 |
| 3 | 5 | 897.3084 | 0.1608 |
| 4 | 6 | 821.6682 | 0.1576 |
| 5 | 7 | 771.4820 | 0.1627 |
| 6 | 8 | 725.5394 | 0.1633 |
| 7 | 9 | 670.6289 | 0.1836 |
###################################
# Plotting the Inertia performance
# by cluster count using a range of K values
# for the K-Means Clustering algorithm
##################################
kmeans_cluster_count_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Count'].values)
kmeans_inertia_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Inertia'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_cluster_count_values, kmeans_inertia_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(500,1500)
plt.title("K-Means Clustering Algorithm: Cluster Count by Inertia")
plt.xlabel("Cluster")
plt.ylabel("Inertia")
plt.show()
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the K-Means Clustering algorithm
##################################
kmeans_cluster_count_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Count'].values)
kmeans_silhouette_score_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_cluster_count_values, kmeans_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("K-Means Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final K-Means Clustering model
# using the optimal cluster count
##################################
kmeans_clustering = KMeans(n_clusters=2,
random_state=88888888,
n_init='auto',
init='k-means++')
kmeans_clustering = kmeans_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Inertia and Silhouette Score
# for the final K-Means Clustering model
##################################
kmeans_clustering_inertia = kmeans_clustering.inertia_
kmeans_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
kmeans_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_kmeans_clustering['KMEANS_CLUSTER'] = kmeans_clustering.predict(cancer_death_rate_kmeans_clustering)
cancer_death_rate_kmeans_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_kmeans_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
Bisecting K-Means Clustering is a variant of the traditional K-Means algorithm which iteratively splits clusters into two parts until the desired number of clusters is reached. It is a hierarchical clustering approach that uses a divisive strategy to build a hierarchy of clusters. The algorithm starts with the entire dataset as the initial cluster. The standard K-Means algorithm is implemented to the selected cluster, splitting it into two sub-clusters. Both steps are repeated until the desired number of clusters is reached. In cases when there are multiple clusters present, the algorithm selects the cluster with the largest variance. This results in a hierarchical structure of clusters, and the process can be stopped at any desired level of granularity.
Silhouette Score assesses the quality of clusters created by a clustering algorithm. It measures how well-separated the clusters are and how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. The silhouette score ranges from -1 to 1, where a higher value indicates better-defined clusters. The silhouette method requires the computation of the silhouette scores for each data point which is the average dissimilarity of the data point with all other data points in the next-nearest cluster minus the average dissimilarity of the data point to points in the same cluster and divided by the larger of the two numbers. The overall silhouette score for the clustering is the average of the silhouette scores for all data points.
##################################
# Fitting the Bisecting K-Means Clustering algorithm
# using a range of K values
##################################
bisecting_kmeans_cluster_list = list()
bisecting_kmeans_cluster_inertia = list()
bisecting_kmeans_cluster_silhouette_score = list()
for cluster_count in range(2,10):
bk = BisectingKMeans(n_clusters=cluster_count,
random_state=88888888,
n_init=1,
init='k-means++')
bk = bk.fit(cancer_death_rate_premodelling_clustering)
bisecting_kmeans_cluster_list.append(cluster_count)
bisecting_kmeans_cluster_inertia.append(bk.inertia_)
bisecting_kmeans_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
bk.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Bisecting K-Means Clustering algorithm
# using a range of K values
##################################
bisecting_kmeans_clustering_evaluation_summary = pd.DataFrame(zip(bisecting_kmeans_cluster_list,
bisecting_kmeans_cluster_inertia,
bisecting_kmeans_cluster_silhouette_score),
columns=['Bisecting.KMeans.Cluster.Count',
'Bisecting.KMeans.Cluster.Inertia',
'Bisecting.KMeans.Cluster.Silhouette.Score'])
bisecting_kmeans_clustering_evaluation_summary
| Bisecting.KMeans.Cluster.Count | Bisecting.KMeans.Cluster.Inertia | Bisecting.KMeans.Cluster.Silhouette.Score | |
|---|---|---|---|
| 0 | 2 | 1238.4894 | 0.2355 |
| 1 | 3 | 1080.6399 | 0.2146 |
| 2 | 4 | 955.1301 | 0.1887 |
| 3 | 5 | 891.9650 | 0.1762 |
| 4 | 6 | 843.0145 | 0.1750 |
| 5 | 7 | 798.7791 | 0.1341 |
| 6 | 8 | 758.0470 | 0.1413 |
| 7 | 9 | 714.1712 | 0.1503 |
###################################
# Plotting the Inertia performance
# by cluster count using a range of K values
# for the Bisecting K-Means Clustering algorithm
##################################
bisecting_kmeans_cluster_count_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Count'].values)
bisecting_kmeans_inertia_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Inertia'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_cluster_count_values, bisecting_kmeans_inertia_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(500,1500)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Inertia")
plt.xlabel("Cluster")
plt.ylabel("Inertia")
plt.show()
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Bisecting K-Means Clustering algorithm
##################################
bisecting_kmeans_cluster_count_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Count'].values)
bisecting_kmeans_silhouette_score_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_cluster_count_values, bisecting_kmeans_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Bisecting K-Means Clustering model
# using the optimal cluster count
##################################
bisecting_kmeans_clustering = BisectingKMeans(n_clusters=2,
random_state=88888888,
n_init=1,
init='k-means++')
bisecting_kmeans_clustering = bisecting_kmeans_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Inertia and Silhouette Score
# for the final Bisecting K-Means Clustering model
##################################
bisecting_kmeans_clustering_inertia = bisecting_kmeans_clustering.inertia_
bisecting_kmeans_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
bisecting_kmeans_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_bisecting_kmeans_clustering['BISECTING_KMEANS_CLUSTER'] = bisecting_kmeans_clustering.predict(cancer_death_rate_bisecting_kmeans_clustering)
cancer_death_rate_bisecting_kmeans_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | BISECTING_KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_bisecting_kmeans_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='BISECTING_KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_bisecting_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='BISECTING_KMEANS_CLUSTER', frameon=False)
plt.show()
Gaussian Mixture Clustering is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters incorporating information about the covariance structure of the data as well as the centers of the latent Gaussians. The algorithm involves initializing the parameters of the Gaussian components using K-means clustering to get initial estimates for the means and the identity matrix as a starting point for the covariance matrices. The expectation-maximization process is applied by calculating the probability of each data point belonging to each Gaussian component using the Bayes' theorem for the expectation step, and updating the parameters of the Gaussian components based on the weighted sum of the data points based on the probabilities determined for the maximization step. Convergence is checked by evaluating whether the log-likelihood of the data has stabilized or reached a maximum. Both steps are iterated until the criteria is met. After convergence, each data point is assigned to the cluster with the highest probability.
Silhouette Score assesses the quality of clusters created by a clustering algorithm. It measures how well-separated the clusters are and how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. The silhouette score ranges from -1 to 1, where a higher value indicates better-defined clusters. The silhouette method requires the computation of the silhouette scores for each data point which is the average dissimilarity of the data point with all other data points in the next-nearest cluster minus the average dissimilarity of the data point to points in the same cluster and divided by the larger of the two numbers. The overall silhouette score for the clustering is the average of the silhouette scores for all data points.
##################################
# Fitting the GMM Clustering algorithm
# using a range of K values
##################################
gaussian_mixture_cluster_list = list()
gaussian_mixture_cluster_silhouette_score = list()
for cluster_count in range(2,10):
gm = GaussianMixture(n_components=cluster_count,
init_params='k-means++',
covariance_type='full',
tol = 1e-3,
random_state=88888888)
gm = gm.fit(cancer_death_rate_premodelling_clustering)
gaussian_mixture_cluster_list.append(cluster_count)
gaussian_mixture_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
gm.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the GMM Clustering algorithm
# using a range of K values
##################################
gaussian_mixture_clustering_evaluation_summary = pd.DataFrame(zip(gaussian_mixture_cluster_list,
gaussian_mixture_cluster_silhouette_score),
columns=['GMM.Cluster.Count',
'GMM.Cluster.Silhouette.Score'])
gaussian_mixture_clustering_evaluation_summary
| GMM.Cluster.Count | GMM.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2239 |
| 1 | 3 | 0.2235 |
| 2 | 4 | 0.2026 |
| 3 | 5 | 0.1205 |
| 4 | 6 | 0.1208 |
| 5 | 7 | 0.1266 |
| 6 | 8 | 0.1320 |
| 7 | 9 | 0.1348 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the GMM Clustering algorithm
##################################
gaussian_mixture_cluster_count_values = np.array(gaussian_mixture_clustering_evaluation_summary['GMM.Cluster.Count'].values)
gaussian_mixture_silhouette_score_values = np.array(gaussian_mixture_clustering_evaluation_summary['GMM.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(gaussian_mixture_cluster_count_values, gaussian_mixture_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("GMM Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final GMM Clustering model
# using the optimal cluster count
##################################
gaussian_mixture_clustering = GaussianMixture(n_components=2,
init_params='k-means++',
random_state=88888888)
gaussian_mixture_clustering = gaussian_mixture_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final GMM Clustering model
##################################
gaussian_mixture_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
gaussian_mixture_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_gaussian_mixture_clustering['GMM_CLUSTER'] = gaussian_mixture_clustering.predict(cancer_death_rate_gaussian_mixture_clustering)
cancer_death_rate_gaussian_mixture_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | GMM_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering_plot = sns.pairplot(cancer_death_rate_gaussian_mixture_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='GMM_CLUSTER');
sns.move_legend(cancer_death_rate_gaussian_mixture_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='GMM_CLUSTER', frameon=False)
plt.show()
Agglomerative Clustering builds a hierarchy of clusters. In this algorithm, each data point starts as its own cluster, and the algorithm merges clusters iteratively until a stopping criterion is met. The algorithm starts with each data point as a singleton cluster with the number of initial clusters is equal to the number of data points. The pairwise distance matrix is calculated between all clusters using complete linkage determined as the maximum distance between any two points in the two clusters. The two clusters that have the minimum distance according to the linkage criterion are identified and merged in the next step. The distances between new clusters and all other clusters are recalculated. All previous steps are repeated until the desired number of clusters is reached or until a stopping criterion is met.
Silhouette Score assesses the quality of clusters created by a clustering algorithm. It measures how well-separated the clusters are and how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. The silhouette score ranges from -1 to 1, where a higher value indicates better-defined clusters. The silhouette method requires the computation of the silhouette scores for each data point which is the average dissimilarity of the data point with all other data points in the next-nearest cluster minus the average dissimilarity of the data point to points in the same cluster and divided by the larger of the two numbers. The overall silhouette score for the clustering is the average of the silhouette scores for all data points.
##################################
# Fitting the Agglomerative Clustering algorithm
# using a range of K values
##################################
agglomerative_cluster_list = list()
agglomerative_cluster_silhouette_score = list()
for cluster_count in range(2,10):
ag = AgglomerativeClustering(n_clusters=cluster_count,
linkage='complete')
ag = ag.fit(cancer_death_rate_premodelling_clustering)
agglomerative_cluster_list.append(cluster_count)
agglomerative_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
ag.fit_predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Agglomerative Clustering algorithm
# using a range of K values
##################################
agglomerative_clustering_evaluation_summary = pd.DataFrame(zip(agglomerative_cluster_list,
agglomerative_cluster_silhouette_score),
columns=['Agglomerative.Cluster.Count',
'Agglomerative.Cluster.Silhouette.Score'])
agglomerative_clustering_evaluation_summary
| Agglomerative.Cluster.Count | Agglomerative.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.1629 |
| 1 | 3 | 0.1311 |
| 2 | 4 | 0.1127 |
| 3 | 5 | 0.1617 |
| 4 | 6 | 0.2035 |
| 5 | 7 | 0.1995 |
| 6 | 8 | 0.2006 |
| 7 | 9 | 0.1968 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Agglomerative Clustering algorithm
##################################
agglomerative_cluster_count_values = np.array(agglomerative_clustering_evaluation_summary['Agglomerative.Cluster.Count'].values)
agglomerative_silhouette_score_values = np.array(agglomerative_clustering_evaluation_summary['Agglomerative.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(agglomerative_cluster_count_values, agglomerative_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("Agglomerative Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Agglomerative Clustering model
# using the optimal cluster count
##################################
agglomerative_clustering = AgglomerativeClustering(n_clusters=2,
linkage='complete')
agglomerative_clustering = agglomerative_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final K-Means Clustering model
##################################
agglomerative_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering, agglomerative_clustering.labels_, metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_agglomerative_clustering['AGGLOMERATIVE_CLUSTER'] = agglomerative_clustering.fit_predict(cancer_death_rate_agglomerative_clustering)
cancer_death_rate_agglomerative_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | AGGLOMERATIVE_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 1 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 0 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering_plot = sns.pairplot(cancer_death_rate_agglomerative_clustering,
kind='reg',
markers=['o', 's'],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='AGGLOMERATIVE_CLUSTER');
sns.move_legend(cancer_death_rate_agglomerative_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='AGGLOMERATIVE_CLUSTER', frameon=False)
plt.show()
Ward Hierarchical Clustering creates compact, well-separated clusters by minimizing the variance within each cluster during the merging process. In this algorithm, each data point starts as its own cluster, and the algorithm merges clusters iteratively until a stopping criterion is met. The algorithm starts with each data point as a singleton cluster with the number of initial clusters is equal to the number of data points. The pairwise distance matrix is calculated between all clusters and used as a measure of dissimilarity. For each cluster, the within-cluster variance is computed which evaluates how tightly the data points within a cluster are grouped. The two clusters that, when merged, result in the smallest increase in the within-cluster variance are identified and merged in the next step. The within-cluster variance for the newly formed cluster are recalculated and the pairwise distance matrix updated. All previous steps are repeated until the desired number of clusters is reached or until a stopping criterion is met.
Silhouette Score assesses the quality of clusters created by a clustering algorithm. It measures how well-separated the clusters are and how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. The silhouette score ranges from -1 to 1, where a higher value indicates better-defined clusters. The silhouette method requires the computation of the silhouette scores for each data point which is the average dissimilarity of the data point with all other data points in the next-nearest cluster minus the average dissimilarity of the data point to points in the same cluster and divided by the larger of the two numbers. The overall silhouette score for the clustering is the average of the silhouette scores for all data points.
##################################
# Fitting the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_cluster_list = list()
ward_hierarchical_cluster_silhouette_score = list()
for cluster_count in range(2,10):
wh = AgglomerativeClustering(n_clusters=cluster_count,
linkage='ward')
wh = wh.fit(cancer_death_rate_premodelling_clustering)
ward_hierarchical_cluster_list.append(cluster_count)
ward_hierarchical_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
wh.fit_predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_clustering_evaluation_summary = pd.DataFrame(zip(ward_hierarchical_cluster_list,
ward_hierarchical_cluster_silhouette_score),
columns=['Ward.Hierarchical.Cluster.Count',
'Ward.Hierarchical.Cluster.Silhouette.Score'])
ward_hierarchical_clustering_evaluation_summary
| Ward.Hierarchical.Cluster.Count | Ward.Hierarchical.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2148 |
| 1 | 3 | 0.1924 |
| 2 | 4 | 0.1840 |
| 3 | 5 | 0.1714 |
| 4 | 6 | 0.1858 |
| 5 | 7 | 0.1803 |
| 6 | 8 | 0.1595 |
| 7 | 9 | 0.1689 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Ward Hierarchical Clustering algorithm
##################################
ward_hierarchical_cluster_count_values = np.array(ward_hierarchical_clustering_evaluation_summary['Ward.Hierarchical.Cluster.Count'].values)
ward_hierarchical_silhouette_score_values = np.array(ward_hierarchical_clustering_evaluation_summary['Ward.Hierarchical.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(ward_hierarchical_cluster_count_values, ward_hierarchical_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("Ward Hierarchical Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Ward Hierarchical Clustering model
# using the optimal cluster count
##################################
ward_hierarchical_clustering = AgglomerativeClustering(n_clusters=2,
linkage='ward')
ward_hierarchical_clustering = agglomerative_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final Ward Hierarchical model
##################################
ward_hierarchical_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering, ward_hierarchical_clustering.labels_, metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_ward_hierarchical_clustering['WARD_HIERARCHICAL_CLUSTER'] = ward_hierarchical_clustering.fit_predict(cancer_death_rate_ward_hierarchical_clustering)
cancer_death_rate_ward_hierarchical_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | WARD_HIERARCHICAL_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 1 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 0 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering_plot = sns.pairplot(cancer_death_rate_ward_hierarchical_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='WARD_HIERARCHICAL_CLUSTER');
sns.move_legend(cancer_death_rate_ward_hierarchical_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='WARD_HIERARCHICAL_CLUSTER', frameon=False)
plt.show()
##################################
# Consolidating all the
# model performance measures
##################################
clustering_silhouette_score_list = [kmeans_clustering_silhouette_score,
bisecting_kmeans_clustering_silhouette_score,
gaussian_mixture_clustering_silhouette_score,
agglomerative_clustering_silhouette_score,
ward_hierarchical_clustering_silhouette_score]
clustering_silhouette_algorithm_list = ['kmeans_clustering',
'bisecting_kmeans_clustering',
'gaussian_mixture_clustering',
'agglomerative_clustering',
'ward_hierarchical_clustering']
performance_comparison_silhouette_score = pd.DataFrame(zip(clustering_silhouette_algorithm_list,
clustering_silhouette_score_list),
columns=['Clustering.Algorithm',
'Silhouette.Score'])
print('Consolidated Model Performance: ')
display(performance_comparison_silhouette_score)
Consolidated Model Performance:
| Clustering.Algorithm | Silhouette.Score | |
|---|---|---|
| 0 | kmeans_clustering | 0.2355 |
| 1 | bisecting_kmeans_clustering | 0.2355 |
| 2 | gaussian_mixture_clustering | 0.2239 |
| 3 | agglomerative_clustering | 0.1629 |
| 4 | ward_hierarchical_clustering | 0.1629 |
##################################
# Plotting all the Silhouette Score
# model performance measures
##################################
performance_comparison_silhouette_score.set_index('Clustering.Algorithm', inplace=True)
performance_comparison_silhouette_score_plot = performance_comparison_silhouette_score.plot.barh(figsize=(10, 6))
performance_comparison_silhouette_score_plot.set_xlim(0.00,1.00)
performance_comparison_silhouette_score_plot.set_title("Model Comparison by Silhouette Score Performance for Number of Clusters=2")
performance_comparison_silhouette_score_plot.set_xlabel("Silhouette Score Performance")
performance_comparison_silhouette_score_plot.set_ylabel("Clustering Model")
performance_comparison_silhouette_score_plot.grid(False)
performance_comparison_silhouette_score_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in performance_comparison_silhouette_score_plot.containers:
performance_comparison_silhouette_score_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
##################################
# Exploring the selected final model
# using the clustering descriptors
# and K-Means clusters
##################################
cancer_death_rate_kmeans_clustering_descriptor = cancer_death_rate_kmeans_clustering.copy()
cancer_death_rate_kmeans_clustering_descriptor.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_descriptor_plot = sns.pairplot(cancer_death_rate_kmeans_clustering_descriptor,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_descriptor_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Computing the average descriptors
# for each K-Means Cluster
##################################
cancer_death_rate_kmeans_clustering_descriptor['KMEANS_CLUSTER'] = np.where(cancer_death_rate_kmeans_clustering_descriptor['KMEANS_CLUSTER']== 0,'HIGH_PAN_LUN_COL_LIV_CAN','HIGH_PRO_BRE_CER_STO_ESO_CAN')
cancer_death_rate_kmeans_descriptor_clustered = cancer_death_rate_kmeans_clustering_descriptor.groupby('KMEANS_CLUSTER').mean()
display(cancer_death_rate_kmeans_descriptor_clustered)
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | |
|---|---|---|---|---|---|---|---|---|---|
| KMEANS_CLUSTER | |||||||||
| HIGH_PAN_LUN_COL_LIV_CAN | -0.4004 | -0.0894 | -0.7876 | -0.4930 | -0.4541 | 0.6040 | 0.7054 | 0.6445 | 0.0465 |
| HIGH_PRO_BRE_CER_STO_ESO_CAN | 0.3550 | 0.0793 | 0.6983 | 0.4371 | 0.4026 | -0.5355 | -0.6254 | -0.5714 | -0.0413 |
##################################
# Computing the average of the
# clustering descriptors
# for each K-Means Cluster
##################################
plt.figure(figsize=(10, 8))
sns.heatmap(cancer_death_rate_kmeans_descriptor_clustered, annot=True, cmap="seismic")
plt.xlabel('Cancer Types')
plt.ylabel('K-Means Clusters')
plt.title('Heatmap of Death Rates by Cancer Type and K-Means Clusters')
plt.show()
##################################
# Exploring the selected final model
# using the target descriptors
# and K-Means clusters
##################################
cancer_death_rate_kmeans_clustering_target = pd.concat([cancer_death_rate_kmeans_clustering[['KMEANS_CLUSTER']],cancer_death_rate_preprocessed[['SMPREV','OWPREV','ACSHAR']]], axis=1, join='inner')
cancer_death_rate_kmeans_clustering_target['KMEANS_CLUSTER'] = np.where(cancer_death_rate_kmeans_clustering_target['KMEANS_CLUSTER']== 0,'HIGH_PAN_LUN_COL_LIV_CAN','HIGH_PRO_BRE_CER_STO_ESO_CAN')
cancer_death_rate_kmeans_clustering_target.head()
| KMEANS_CLUSTER | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5405 | -1.4979 | -1.6782 |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | 0.5329 | 0.6090 | 0.4008 |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | -0.6438 | 0.9033 | -1.3345 |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | 1.1517 | 1.0213 | 1.1371 |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -1.0431 | -1.2574 | 0.3520 |
##################################
# Computing the target descriptors
# for each K-Means Cluster
##################################
cancer_death_rate_kmeans_target_clustered = cancer_death_rate_kmeans_clustering_target.groupby('KMEANS_CLUSTER').mean()
display(cancer_death_rate_kmeans_target_clustered)
| SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|
| KMEANS_CLUSTER | |||
| HIGH_PAN_LUN_COL_LIV_CAN | 0.6433 | 0.4329 | 0.3218 |
| HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5704 | -0.3838 | -0.2853 |
##################################
# Computing the average of the
# target descriptors
# for each K-Means Cluster
##################################
plt.figure(figsize=(10, 8))
sns.heatmap(cancer_death_rate_kmeans_target_clustered, annot=True, cmap="seismic")
plt.xlabel('Lifestyle Factors')
plt.ylabel('K-Means Clusters')
plt.title('Heatmap of Lifestyle Factors and K-Means Clusters')
plt.show()
##################################
# Exploring the selected final model
# using the location data
# and K-Means clusters
##################################
cancer_death_rate_kmeans_cluster_map = pd.concat([cancer_death_rate_kmeans_clustering_target[['KMEANS_CLUSTER']],cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_cluster_map.head()
| KMEANS_CLUSTER | CODE | |
|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AFG |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | ALB |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | DZA |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | AND |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AGO |
##################################
# Loading world map shapefile
# obtained from https://geojson-maps.ash.ms/
##################################
world = gpd.read_file('custom.geo.json')
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_cluster = world.merge(cancer_death_rate_kmeans_cluster_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by K-Means cluster
##################################
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
world_cluster.boundary.plot(ax=ax, linewidth=1)
world_cluster.plot(column='KMEANS_CLUSTER', cmap="seismic", legend=True, ax=ax, legend_kwds={"loc": "center left", "bbox_to_anchor": (1, 0.5)})
plt.title('KMEANS_CLUSTER')
plt.show()
##################################
# Plotting the map by K-Means descriptors
##################################
cancer_death_rate_kmeans_descriptor_map = pd.concat([cancer_death_rate_kmeans_clustering_descriptor,cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_descriptor_map.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | CODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AFG |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | HIGH_PAN_LUN_COL_LIV_CAN | ALB |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | HIGH_PAN_LUN_COL_LIV_CAN | DZA |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | HIGH_PAN_LUN_COL_LIV_CAN | AND |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AGO |
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_descriptor = world.merge(cancer_death_rate_kmeans_descriptor_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by Pancreatic Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='PANCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "PANCAN"})
plt.title('PANCAN')
plt.show()
##################################
# Plotting the map by Lung Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='LUNCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "LUNCAN"})
plt.title('LUNCAN')
plt.show()
##################################
# Plotting the map by Colon Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='COLCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "COLCAN"})
plt.title('COLCAN')
plt.show()
##################################
# Plotting the map by Liver Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='LIVCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "LIVCAN"})
plt.title('LIVCAN')
plt.show()
##################################
# Plotting the map by Prostate Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='PROCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "PROCAN"})
plt.title('PROCAN')
plt.show()
##################################
# Plotting the map by Breast Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='BRECAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "BRECAN"})
plt.title('BRECAN')
plt.show()
##################################
# Plotting the map by Cervical Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='CERCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "CERCAN"})
plt.title('CERCAN')
plt.show()
##################################
# Plotting the map by Stomach Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='STOCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "STOCAN"})
plt.title('STOCAN')
plt.show()
##################################
# Plotting the map by Esophagus Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='ESOCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "ESOCAN"})
plt.title('ESOCAN')
plt.show()
##################################
# Plotting the map by K-Means target
##################################
cancer_death_rate_kmeans_target_map = pd.concat([cancer_death_rate_kmeans_clustering_target,cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_target_map.head()
| KMEANS_CLUSTER | SMPREV | OWPREV | ACSHAR | CODE | |
|---|---|---|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5405 | -1.4979 | -1.6782 | AFG |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | 0.5329 | 0.6090 | 0.4008 | ALB |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | -0.6438 | 0.9033 | -1.3345 | DZA |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | 1.1517 | 1.0213 | 1.1371 | AND |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -1.0431 | -1.2574 | 0.3520 | AGO |
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_target = world.merge(cancer_death_rate_kmeans_target_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by Smoking Prevalence
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_target.boundary.plot(ax=ax, linewidth=1)
world_target.plot(column='SMPREV', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "SMPREV"})
plt.title('SMPREV')
plt.show()
##################################
# Plotting the map by Overweight Prevalence
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_target.boundary.plot(ax=ax, linewidth=1)
world_target.plot(column='OWPREV', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "OWPREV"})
plt.title('OWPREV')
plt.show()
##################################
# Plotting the map by Alcohol Consumption
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_target.boundary.plot(ax=ax, linewidth=1)
world_target.plot(column='ACSHAR', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "ACSHAR"})
plt.title('ACSHAR')
plt.show()
A detailed report was formulated documenting all the analysis steps and findings.
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))